Skip to content

OADP-7541: fix non-deterministic matchExpressions ordering causing node-agent restarts#2234

Open
kaovilai wants to merge 4 commits into
openshift:oadp-devfrom
kaovilai:OADP-7541-fix-nodeagent-matchexpressions-ordering
Open

OADP-7541: fix non-deterministic matchExpressions ordering causing node-agent restarts#2234
kaovilai wants to merge 4 commits into
openshift:oadp-devfrom
kaovilai:OADP-7541-fix-nodeagent-matchexpressions-ordering

Conversation

@kaovilai

@kaovilai kaovilai commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

  • When a DPA configures nodeAgent.podConfig.nodeSelector with multiple labels, kube.ToSystemAffinity() iterates Go maps in random order to build matchExpressions. Kubernetes treats arrays as ordered, so each order flip registers as a spec change, triggering DaemonSet rollouts every reconciliation (~30s). Customer observed generation 13858 on node-agent while velero deployment stayed at 2.
  • Adds common.SortNodeSelectorTerms() helper that sorts MatchExpressions by key within each NodeSelectorTerm
  • Applied to both node-agent DaemonSet and Velero Deployment code paths
  • Unit tests run 100 iterations with multi-label selectors to verify deterministic ordering

Test plan

  • New unit tests pass: TestNodeAgentAffinityMatchExpressionsDeterministicOrder (3 sub-cases × 100 iterations)
  • New unit tests pass: TestVeleroAffinityMatchExpressionsDeterministicOrder (100 iterations)
  • Tests run with -count=5 for additional confidence
  • All files compile cleanly
  • E2E: deploy with multi-label nodeSelector, verify node-agent generation stays stable

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Node affinity match expressions for NodeAgent and Velero are now deterministically ordered, preventing non-deterministic pod/deployment spec diffs and ensuring consistent scheduling.
  • Tests

    • Added unit tests that repeatedly validate deterministic ordering of node affinity match expressions across multi-label scenarios to guard against nondeterminism.

…de-agent restarts

When a DPA configures nodeAgent.podConfig.nodeSelector with multiple
labels, kube.ToSystemAffinity() iterates Go maps in random order to
build matchExpressions. Kubernetes treats arrays as ordered, so each
order flip registers as a spec change, triggering DaemonSet rollouts
every reconciliation (~30s). Customer observed generation 13858 on
node-agent while velero deployment stayed at 2.

Sort matchExpressions by key after building them from LoadAffinityConfig
to ensure deterministic ordering. Applied to both node-agent DaemonSet
and Velero Deployment code paths.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 8, 2026

Copy link
Copy Markdown

@kaovilai: This pull request references OADP-7541 which is a valid jira issue.

Details

In response to this:

Summary

  • When a DPA configures nodeAgent.podConfig.nodeSelector with multiple labels, kube.ToSystemAffinity() iterates Go maps in random order to build matchExpressions. Kubernetes treats arrays as ordered, so each order flip registers as a spec change, triggering DaemonSet rollouts every reconciliation (~30s). Customer observed generation 13858 on node-agent while velero deployment stayed at 2.
  • Adds common.SortNodeSelectorTerms() helper that sorts MatchExpressions by key within each NodeSelectorTerm
  • Applied to both node-agent DaemonSet and Velero Deployment code paths
  • Unit tests run 100 iterations with multi-label selectors to verify deterministic ordering

Test plan

  • New unit tests pass: TestNodeAgentAffinityMatchExpressionsDeterministicOrder (3 sub-cases × 100 iterations)
  • New unit tests pass: TestVeleroAffinityMatchExpressionsDeterministicOrder (100 iterations)
  • Tests run with -count=5 for additional confidence
  • All files compile cleanly
  • E2E: deploy with multi-label nodeSelector, verify node-agent generation stays stable

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d17cc83-e233-409c-a684-1b7d35aeebb0

📥 Commits

Reviewing files that changed from the base of the PR and between 7c88cb6 and 516587c.

📒 Files selected for processing (1)
  • pkg/common/common.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/common/common.go

Walkthrough

Sort NodeSelectorTerm MatchExpressions/MatchFields by requirement key via new common.SortNodeSelectorTerms; apply the sorter in NodeAgent and Velero affinity assembly and add tests asserting deterministic ordering.

Changes

Deterministic node affinity ordering

Layer / File(s) Summary
Sorting utility for node selector terms
pkg/common/common.go
Add cmp and slices imports and implement SortNodeSelectorTerms to sort each corev1.NodeSelectorTerm's MatchExpressions and MatchFields by requirement Key.
NodeAgent DaemonSet affinity determinism
internal/controller/nodeagent.go, internal/controller/nodeagent_test.go
Call common.SortNodeSelectorTerms(terms) in ReconcileNodeAgentDaemonset before applying to DaemonSet affinity; add TestNodeAgentAffinityMatchExpressionsDeterministicOrder to verify sorted ordering across multiple label scenarios run 100 times each.
Velero Deployment affinity determinism
internal/controller/velero.go, internal/controller/velero_test.go
Call common.SortNodeSelectorTerms(terms) in customizeVeleroDeployment before applying to pod affinity; import testify/require, adjust expected MatchExpressions ordering in an existing complex-case test, and add TestVeleroAffinityMatchExpressionsDeterministicOrder to assert deterministic ordering across iterations.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing non-deterministic matchExpressions ordering that caused node-agent restarts, directly matching the changeset's core objective.
Description check ✅ Passed The PR description provides comprehensive context on the problem, the solution, and testing approach, covering both required template sections with specific technical details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are static and deterministic with no dynamic information (UUIDs, timestamps, generated suffixes, pod/node names, or IPs). Test titles clearly describe what they validate.
Test Structure And Quality ✅ Passed Custom check requires review of Ginkgo test code, but PR adds standard Go tests using testing.T pattern, not Ginkgo (no Describe/It/BeforeEach/AfterEach blocks). Check not applicable.
Microshift Test Compatibility ✅ Passed The added tests are standard Go unit tests (using testing.T), not Ginkgo e2e tests. The check applies only to Ginkgo e2e tests; therefore it is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR adds standard Go unit tests (not Ginkgo e2e tests). The check applies only to Ginkgo tests (It(), Describe(), Context(), When()), so it does not apply here.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds deterministic sorting to user-provided LoadAffinityConfig. No new scheduling constraints, hardcoded node selectors, or topology assumptions introduced.
Ote Binary Stdout Contract ✅ Passed PR adds deterministic sorting logic and tests with no process-level stdout writes. All changes are contained within regular functions and test code that doesn't violate OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only standard Go unit tests (func Test*), not Ginkgo e2e tests. Tests operate on in-memory Kubernetes objects with no IPv4 assumptions or external connectivity requirements.
No-Weak-Crypto ✅ Passed PR adds sorting of Kubernetes NodeSelectorTerms using standard library functions; no weak crypto patterns, custom crypto implementations, or insecure secret comparisons detected.
Container-Privileges ✅ Passed PR introduces only NodeSelectorTerms sorting logic for determinism; no privileged, hostPID, hostNetwork, hostIPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation settings are added.
No-Sensitive-Data-In-Logs ✅ Passed No new logging statements that expose sensitive data were added. The PR adds only sorting function calls with no logging or debug output.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from mrnold and sseago June 8, 2026 20:54
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2026
@kaovilai

kaovilai commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

/cherry-pick oadp-1.6

@openshift-cherrypick-robot

Copy link
Copy Markdown
Contributor

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.6 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick oadp-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kaovilai

kaovilai commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

/cherry-pick oadp-1.5

@openshift-cherrypick-robot

Copy link
Copy Markdown
Contributor

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.5 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick oadp-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kaovilai and others added 2 commits June 9, 2026 12:10
Sort expected matchExpressions alphabetically by key to match the
deterministic ordering introduced in the previous commit.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai

kaovilai commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

/retest-required

Note

Responses generated with Claude

5.0-e2e-test-aws failed due to infrastructure flake — test pod received SIGTERM 1s after startup (exit code 130), no tests executed. All other E2E jobs passed.

Comment thread internal/controller/nodeagent.go
Comment thread pkg/common/common.go
Address review feedback: sort both MatchExpressions and MatchFields
by key for completeness and future-proofing.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai

Copy link
Copy Markdown
Member Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@weshayutin

Copy link
Copy Markdown
Contributor

@PrasadJoshi12 do you have an automated test case for this?

@kaovilai

kaovilai commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

E2E Verification: OADP-7541 fix confirmed

Environment

  • Cluster: OCP 5.0.0-ec.2 on AWS (arm64)
  • Operator image: ttl.sh/oadp-operator-516587c3:1h (built from this PR branch, commit 516587c3)
  • Install method: OLMv1 ClusterExtension with ClusterCatalog pointing to ttl.sh/oadp-operator-catalog-516587c3:1h

Test: Multi-label nodeSelector stability

Created DPA with two nodeSelector labels (reproducing the customer's multi-label config from OADP-7541):

spec:
  configuration:
    nodeAgent:
      enable: true
      uploaderType: kopia
      podConfig:
        nodeSelector:
          kubernetes.io/os: linux
          kubernetes.io/arch: arm64

Result: PASS

matchExpressions correctly sorted by key:

[
  {"key": "kubernetes.io/arch", "operator": "In", "values": ["arm64"]},
  {"key": "kubernetes.io/os", "operator": "In", "values": ["linux"]}
]

Generation monitoring over 5 minutes (10 checks at 30s intervals):

16:17:58: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:18:29: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:18:59: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:19:30: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:20:01: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:20:32: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:21:03: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:21:34: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:22:04: node-agent gen=1 observed=1 | velero gen=1 observed=1
16:22:35: node-agent gen=1 observed=1 | velero gen=1 observed=1
  • node-agent DaemonSet generation stayed at 1 (before the fix, customer saw generation 13858)
  • velero Deployment generation also stable at 1
  • Zero node-agent/daemonset update entries in operator logs over the 5-minute window

Note

Responses generated with Claude

@PrasadJoshi12

Copy link
Copy Markdown
Contributor

@PrasadJoshi12 do you have an automated test case for this?

No, we need to add one

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai, sseago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants